MxNet框架編程介紹

林嶔 (Lin, Chin)

Lesson 5

MxNet基本介紹(1)

– 透過前幾節課的推導,我們應該了解到這種開發框架其實也不是很難實現,這是因為目前的人工神經網路無論多深多複雜,也一定可以被定義成一個複雜函數的組合,透過分別解微分以及連鎖律我們將能求得任何參數的梯度,從而能實現梯度下降法來訓練網路。

– 但要注意一點,僅有64位元的作業系統能安裝MxNet。

– 他的安裝方法比較特別,並且有安裝GPU版本的方法,下面是在WINDOW系統安裝CPU版本的作法:

– 如果是你的R語言是3.5以下的版本,你可能要參考這個語法安裝

cran <- getOption("repos")
cran["dmlc"] <- "https://s3-us-west-2.amazonaws.com/apache-mxnet/R/CRAN/"
options(repos = cran)
install.packages("mxnet")

– 如果是你的R語言是3.6以上的版本,你可能要參考這個語法安裝:

install.packages("https://s3.ca-central-1.amazonaws.com/jeremiedb/share/mxnet/CPU/3.6/mxnet.zip", repos = NULL)

– 如果是你的R語言是3.6以上的版本,你可能要參考這個語法安裝:

cran <- getOption("repos")
cran["dmlc"] <- "https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/R/CRAN/"
options(repos = cran)
install.packages("mxnet")
library(mxnet)

MxNet基本介紹(2)

  1. 資料結構

  2. 預測函數(網路結構)

  3. 優化方法

  1. Iterator

  2. Model architecture

  3. Optimizer

Iterator的編寫(1)

– 注意一下我們所定義的維度,在MxNet框架中最後一個維度都是Sample size。

set.seed(0)

X1 = rnorm(1000) 
X2 = rnorm(1000) 
Y = X1 * 0.7 + X2 * 1.3 - 3.1 + rnorm(1000)

X.array = array(rbind(X1, X2), dim = c(2, 1000))
Y.array = array(Y, dim = c(1, 1000))
my_iterator_core = function(batch_size) {
  
  batch = 0
  batch_per_epoch = length(Y.array)/batch_size
  
  reset = function() {batch <<- 0}
  
  iter.next = function() {
    batch <<- batch+1
    if (batch > batch_per_epoch) {return(FALSE)} else {return(TRUE)}
  }
  
  value = function() {
    idx = 1:batch_size + (batch - 1) * batch_size
    idx[idx > ncol(X.array)] = sample(1:ncol(X.array), sum(idx > ncol(X.array)))
    data = mx.nd.array(X.array[,idx, drop=FALSE])
    label = mx.nd.array(Y.array[,idx, drop=FALSE])
    return(list(data = data, label = label))
  }
  
  return(list(reset = reset, iter.next = iter.next, value = value, batch_size = batch_size, batch = batch))
}
my_iterator_func <- setRefClass("Custom_Iter",
                                fields = c("iter", "batch_size"),
                                contains = "Rcpp_MXArrayDataIter",
                                methods = list(
                                  initialize = function(iter, batch_size = 100){
                                    .self$iter <- my_iterator_core(batch_size = batch_size)
                                    .self
                                  },
                                  value = function(){
                                    .self$iter$value()
                                  },
                                  iter.next = function(){
                                    .self$iter$iter.next()
                                  },
                                  reset = function(){
                                    .self$iter$reset()
                                  },
                                  finalize=function(){
                                  }
                                )
)

Iterator的編寫(2)

my_iter = my_iterator_func(iter = NULL, batch_size = 20)
  1. 重置batch(切換到下一個epoch):
my_iter$reset()
  1. 跳到下一個batch:
my_iter$iter.next()
## [1] TRUE
  1. 產生小批量樣本:
my_iter$value()
## $data
##            [,1]       [,2]       [,3]      [,4]       [,5]        [,6]
## [1,]  1.2629542 -0.3262334  1.3297993  1.272429  0.4146414 -1.53995001
## [2,] -0.2868516  1.8411069 -0.1567643 -1.389803 -1.4731040 -0.06951893
##            [,7]       [,8]         [,9]     [,10]      [,11]      [,12]
## [1,] -0.9285671 -0.2947204 -0.005767173  2.404653  0.7635934 -0.7990093
## [2,]  0.2392414  0.2504199 -0.264423937 -1.975400 -0.4498760  0.9264656
##          [,13]      [,14]      [,15]      [,16]      [,17]      [,18]
## [1,] -1.147657 -0.2894616 -0.2992151 -0.4115108  0.2522235 -0.8919211
## [2,] -2.319971  0.6139336 -1.4731401 -0.2190492 -1.4034163  0.8206284
##           [,19]      [,20]
## [1,]  0.4356833 -1.2375385
## [2,] -0.5927263  0.4219688
## 
## $label
##           [,1]      [,2]      [,3]     [,4]      [,5]      [,6]     [,7]
## [1,] -2.144493 -0.922995 -2.382214 -4.31842 -4.232431 -4.871059 -4.12126
##           [,8]      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
## [1,] -2.694021 -3.282202 -4.826817 -3.972437 -3.884252 -6.784581 -1.874525
##          [,15]     [,16]     [,17]     [,18]     [,19]     [,20]
## [1,] -4.103997 -2.750563 -2.639356 -1.861134 -6.239434 -3.645907

Model architecture的編寫

– 函數「mx.symbol.Variable」用來產生一個新的模型起點,他通常是Iterator所輸出的名稱,有時候也可以用來指定權重的名稱

– 函數「mx.symbol.FullyConnected」代表著全連接層,這相當於之前MLP做線性運算的\(L\)函數,其中參數「num.hidden」意旨輸出維度

data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
weight = mx.symbol.Variable(name = 'weight')
bias = mx.symbol.Variable(name = 'bias')
pred_y = mx.symbol.FullyConnected(data = data, weight = weight, bias = bias,
                                  num.hidden = 1, no.bias = FALSE, name = 'pred_y')
mx.symbol.infer.shape(pred_y, data = c(2, 20))
## $arg.shapes
## $arg.shapes$data
## [1]  2 20
## 
## $arg.shapes$weight
## [1] 2 1
## 
## $arg.shapes$bias
## [1] 1
## 
## 
## $out.shapes
## $out.shapes$pred_y_output
## [1]  1 20
## 
## 
## $aux.shapes
## named list()
residual = mx.symbol.broadcast_minus(lhs = label, rhs = pred_y) 
square_residual = mx.symbol.square(data = residual)
mean_square_residual = mx.symbol.mean(data = square_residual, axis = 0, keepdims = FALSE)
mse_loss = mx.symbol.MakeLoss(data = mean_square_residual, name = 'rmse')
mx.symbol.infer.shape(mse_loss, data = c(2, 20), label = c(1, 20))$out.shapes
## $rmse_output
## [1] 1

Optimizer的編寫

my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
  epsilon = 1e-08, wd = 0)
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)

模型運行(1)

– 注意,這裡的num.round是指epoch的意思,而非batch。

mx.set.seed(0)

lr_model = mx.model.FeedForward.create(symbol = mse_loss, X = my_iter, optimizer = my_optimizer,
                                       array.batch.size = 20, ctx = mx.cpu(), num.round = 10)
lr_model$arg.params
## $weight
##           [,1]
## [1,] 0.6362522
## [2,] 1.3090633
## 
## $bias
## [1] -3.322609
lm(Y ~ X1 + X2)
## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Coefficients:
## (Intercept)           X1           X2  
##     -3.0312       0.7117       1.3192

模型運行(2)

my.eval.metric.loss <- mx.metric.custom(
  name = "MSE-loss", 
  function(real, pred) {
    return(as.array(pred))
  }
)

my_logger = mx.metric.logger$new()

lr_model = mx.model.FeedForward.create(symbol = mse_loss, X = my_iter, optimizer = my_optimizer,
                                       array.batch.size = 20, ctx = mx.cpu(), num.round = 10,
                                       eval.metric = my.eval.metric.loss,
                                       epoch.end.callback = mx.callback.log.train.metric(1, my_logger))
plot(my_logger$train, type = 'l', ylab = 'MSE loss')

練習1:請試著利用MxNet做出邏輯斯迴歸

– 這是一個模擬的邏輯斯回歸樣本,請你試著用MxNet編寫程式並求解其權重參數。

set.seed(0)

X1 = rnorm(1000) 
X2 = rnorm(1000)
LR = X1 * 2 + X2 * 3 + rnorm(1000)
PROP = 1/(1+exp(-LR))
Y = as.integer(PROP > runif(1000))

X.array = array(rbind(X1, X2), dim = c(2, 1000))
Y.array = array(Y, dim = c(1, 1000))

– 這是函數「glm」的運算結果:

glm(Y ~ X1 + X2, family = 'binomial')
## 
## Call:  glm(formula = Y ~ X1 + X2, family = "binomial")
## 
## Coefficients:
## (Intercept)           X1           X2  
##    -0.01483      1.91820      2.70523  
## 
## Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
## Null Deviance:       1385 
## Residual Deviance: 654.9     AIC: 660.9

\[ \begin{align} log(\frac{{p}}{1-p}) & = b_{0} + b_{1}x \\ p & = \frac{{1}}{1+e^{-b_{0} - b_{1}x}} \\ loss & = CE(y, p) = -\frac{{1}}{n}\sum \limits_{i=1}^{n} \left(y_{i} \cdot log(p_{i}) + (1-y_{i}) \cdot log(1-p_{i})\right) \end{align} \]

練習1答案

my_iterator_core = function(batch_size) {
  
  batch = 0
  batch_per_epoch = length(Y.array)/batch_size
  
  reset = function() {batch <<- 0}
  
  iter.next = function() {
    batch <<- batch+1
    if (batch > batch_per_epoch) {return(FALSE)} else {return(TRUE)}
  }
  
  value = function() {
    idx = 1:batch_size + (batch - 1) * batch_size
    idx[idx > ncol(X.array)] = sample(1:ncol(X.array), sum(idx > ncol(X.array)))
    data = mx.nd.array(X.array[,idx, drop=FALSE])
    label = mx.nd.array(Y.array[,idx, drop=FALSE])
    return(list(data = data, label = label))
  }
  
  return(list(reset = reset, iter.next = iter.next, value = value, batch_size = batch_size, batch = batch))
}

my_iterator_func <- setRefClass("Custom_Iter",
                                fields = c("iter", "batch_size"),
                                contains = "Rcpp_MXArrayDataIter",
                                methods = list(
                                  initialize = function(iter, batch_size = 100){
                                    .self$iter <- my_iterator_core(batch_size = batch_size)
                                    .self
                                  },
                                  value = function(){
                                    .self$iter$value()
                                  },
                                  iter.next = function(){
                                    .self$iter$iter.next()
                                  },
                                  reset = function(){
                                    .self$iter$reset()
                                  },
                                  finalize=function(){
                                  }
                                )
)

my_iter = my_iterator_func(iter = NULL, batch_size = 20)
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
linear_pred = mx.symbol.FullyConnected(data = data,
                                       num.hidden = 1, no.bias = FALSE, name = 'linear_pred')
logistic_pred = mx.symbol.sigmoid(data = linear_pred, name = 'logistic_pred')

eps = 1e-8
ce_loss_pos =  mx.symbol.broadcast_mul(mx.symbol.log(logistic_pred + eps), label)
ce_loss_neg =  mx.symbol.broadcast_mul(mx.symbol.log(1 - logistic_pred + eps), 1 - label)
ce_loss_mean = 0 - mx.symbol.mean(ce_loss_pos + ce_loss_neg)
ce_loss = mx.symbol.MakeLoss(ce_loss_mean, name = 'ce_loss')
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)
mx.set.seed(0)

logistic_model = mx.model.FeedForward.create(symbol = ce_loss, X = my_iter, optimizer = my_optimizer,
                                             array.batch.size = 20, ctx = mx.cpu(), num.round = 100)

logistic_model$arg.params
## $linear_pred_weight
##          [,1]
## [1,] 1.842254
## [2,] 2.765283
## 
## $linear_pred_bias
## [1] 0.0406286

MxNet的底層執行器(1)

– 在前面的練習題中,我們已經寫好了邏輯斯迴歸的Iterator、Model architecture、Optimizer等三個部分,現在讓我們再編寫結合這三者的Executor。

my_executor = mx.simple.bind(symbol = ce_loss,
                             data = c(2, 20), label = c(1, 20),
                             ctx = mx.cpu(), grad.req = "write")
mx.set.seed(0)

new_arg = mxnet:::mx.model.init.params(symbol = ce_loss,
                                       input.shape = list(data = c(2, 20), label = c(1, 20)),
                                       output.shape = NULL,
                                       initializer = mxnet:::mx.init.uniform(0.01),
                                       ctx = mx.cpu())
mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)
my_updater = mx.opt.get.updater(optimizer = my_optimizer, weights = my_executor$ref.arg.arrays)

MxNet的底層執行器(2)

– 「my_iter$iter.next()」會持續回答TRUE直到一個epoch結束,而每個epoch的

my_iter$reset()

while (my_iter$iter.next()) {
  
  my_values <- my_iter$value()
  mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
  mx.exec.forward(my_executor, is.train = TRUE)
  mx.exec.backward(my_executor)
  update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
  mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
  
}
set.seed(0)

for (i in 1:100) {
  
  my_iter$reset()
  batch_loss = NULL
  
  while (my_iter$iter.next()) {
    
    my_values <- my_iter$value()
    mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
    mx.exec.forward(my_executor, is.train = TRUE)
    mx.exec.backward(my_executor)
    update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
    mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
    batch_loss = c(batch_loss, as.array(my_executor$ref.outputs$ce_loss_output))
    
  }
  
  if (i %% 10 == 0 | i < 10) {
    cat(paste0("epoch = ", i,
               ": Cross-Entropy (CE) = ", formatC(mean(batch_loss), format = "f", 4),
               "; weight = ", paste(formatC(as.array(my_executor$ref.arg.arrays$linear_pred_weight), format = "f", 3), collapse = ", "),
               "; bias = ", formatC(as.array(my_executor$ref.arg.arrays$linear_pred_bias), format = "f", 3), "\n"))
  }
  
}
## epoch = 1: Cross-Entropy (CE) = 0.3328; weight = 1.681, 2.546; bias = 0.038
## epoch = 2: Cross-Entropy (CE) = 0.3305; weight = 1.763, 2.657; bias = 0.040
## epoch = 3: Cross-Entropy (CE) = 0.3301; weight = 1.802, 2.710; bias = 0.040
## epoch = 4: Cross-Entropy (CE) = 0.3300; weight = 1.821, 2.736; bias = 0.040
## epoch = 5: Cross-Entropy (CE) = 0.3300; weight = 1.831, 2.750; bias = 0.040
## epoch = 6: Cross-Entropy (CE) = 0.3300; weight = 1.836, 2.757; bias = 0.041
## epoch = 7: Cross-Entropy (CE) = 0.3300; weight = 1.839, 2.761; bias = 0.041
## epoch = 8: Cross-Entropy (CE) = 0.3300; weight = 1.841, 2.763; bias = 0.041
## epoch = 9: Cross-Entropy (CE) = 0.3300; weight = 1.841, 2.764; bias = 0.041
## epoch = 10: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 20: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 30: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 40: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 50: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 60: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 70: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 80: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 90: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041
## epoch = 100: Cross-Entropy (CE) = 0.3300; weight = 1.842, 2.765; bias = 0.041

MxNet的底層執行器(3)

logistic_model <- mxnet:::mx.model.extract.model(symbol = logistic_pred,
                                                 train.execs = list(my_executor))
mx.model.save(logistic_model, "logistic_regression", iteration = 0)
logistic_model = mx.model.load("logistic_regression", iteration = 0)
predict_Y = predict(logistic_model, X.array, array.layout = "colmajor")
confusion_table = table(predict_Y > 0.5, Y)
print(confusion_table)
##        Y
##           0   1
##   FALSE 445  76
##   TRUE   74 405

MxNet的底層執行器(4)

– MxNet的底層執行器來既然能進行Forward/Backward程序,當然也可以只進行Forward程序:

all_layers = logistic_model$symbol$get.internals()

linear_pred_output = all_layers$get.output(which(all_layers$outputs == 'linear_pred_output'))

symbol_group = mx.symbol.Group(c(linear_pred_output, logistic_model$symbol))
Pred_executor = mx.simple.bind(symbol = symbol_group, data = c(2, 1000), ctx = mx.cpu())

– 執行器已經寫完了,剩下的程序差不多,只有一些細節上的差異:

mx.exec.update.arg.arrays(Pred_executor, logistic_model$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(Pred_executor, logistic_model$aux.params, match.name = TRUE)
mx.exec.update.arg.arrays(Pred_executor, list(data = mx.nd.array(X.array)), match.name = TRUE)
mx.exec.forward(Pred_executor, is.train = FALSE)

PRED_linear_pred = as.array(Pred_executor$ref.outputs$linear_pred_output)
PRED_logistic_pred = as.array(Pred_executor$ref.outputs$logistic_pred_output)
confusion_table = table(PRED_logistic_pred > 0.5, Y)
print(confusion_table)
##        Y
##           0   1
##   FALSE 445  76
##   TRUE   74 405

練習2:請試著利用MxNet的底層執行器做出IRIS資料集的預測

F5_1

data(iris)

X.array = array(t(as.matrix(iris[,-5])), dim = c(4, 150))
Y.array = array(t(model.matrix(~ -1 + iris[,5])), dim = c(3, 150))

set.seed(0)
TRAIN.seq = sample(1:150, 100)

TRAIN.X.array = X.array[,TRAIN.seq]
TRAIN.Y.array = Y.array[,TRAIN.seq]
TEST.X.array = X.array[,-TRAIN.seq]
TEST.Y.array = Y.array[,-TRAIN.seq]

\[ \begin{align} mlogloss(y_{i}, p_{i}) & = - \sum \limits_{i=1}^{m} y_{i} log(p_{i}) \end{align} \]

練習2答案(1)

  1. Iterator的部分:
my_iterator_core = function(batch_size) {
  
  batch = 0
  batch_per_epoch = ncol(TRAIN.Y.array)/batch_size
  
  reset = function() {batch <<- 0}
  
  iter.next = function() {
    batch <<- batch+1
    if (batch > batch_per_epoch) {return(FALSE)} else {return(TRUE)}
  }
  
  value = function() {
    idx = 1:batch_size + (batch - 1) * batch_size
    idx[idx > ncol(TRAIN.Y.array)] = sample(1:ncol(TRAIN.Y.array), sum(idx > ncol(TRAIN.Y.array)))
    data = mx.nd.array(TRAIN.X.array[,idx, drop=FALSE])
    label = mx.nd.array(TRAIN.Y.array[,idx, drop=FALSE])
    return(list(data = data, label = label))
  }
  
  return(list(reset = reset, iter.next = iter.next, value = value, batch_size = batch_size, batch = batch))
}

my_iterator_func <- setRefClass("Custom_Iter",
                                fields = c("iter", "batch_size"),
                                contains = "Rcpp_MXArrayDataIter",
                                methods = list(
                                  initialize = function(iter, batch_size = 100){
                                    .self$iter <- my_iterator_core(batch_size = batch_size)
                                    .self
                                  },
                                  value = function(){
                                    .self$iter$value()
                                  },
                                  iter.next = function(){
                                    .self$iter$iter.next()
                                  },
                                  reset = function(){
                                    .self$iter$reset()
                                  },
                                  finalize=function(){
                                  }
                                )
)

my_iter = my_iterator_func(iter = NULL, batch_size = 20)
  1. Model architecture的部分:
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
linear_pred = mx.symbol.FullyConnected(data = data, num.hidden = 3, name = 'linear_pred')
softmax_layer = mx.symbol.softmax(data = linear_pred, axis = 1, name = 'softmax_layer')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
  1. 這是Optimizer的部分:
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)

練習2答案(2)

#1. Build an executor to train model

my_executor = mx.simple.bind(symbol = m_logloss,
                             data = c(4, 20), label = c(3, 20),
                             ctx = mx.cpu(), grad.req = "write")

#2. Set the initial parameters
mx.set.seed(0)

new_arg = mxnet:::mx.model.init.params(symbol = m_logloss,
                                       input.shape = list(data = c(4, 20), label = c(3, 20)),
                                       output.shape = NULL,
                                       initializer = mxnet:::mx.init.uniform(0.01),
                                       ctx = mx.cpu())
mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)

#3. Define the updater
my_updater = mx.opt.get.updater(optimizer = my_optimizer, weights = my_executor$ref.arg.arrays)
set.seed(0)

for (i in 1:200) {
  
  my_iter$reset()
  batch_loss = NULL
  
  while (my_iter$iter.next()) {
    
    my_values <- my_iter$value()
    mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
    mx.exec.forward(my_executor, is.train = TRUE)
    mx.exec.backward(my_executor)
    update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
    mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
    batch_loss = c(batch_loss, as.array(my_executor$ref.outputs$m_logloss_output))
    
  }
  
  if (i %% 10 == 0 | i <= 5) {
    cat(paste0("epoch = ", i, ": m-logloss = ", formatC(mean(batch_loss), format = "f", 4), "\n"))
  }
  
}
## epoch = 1: m-logloss = 0.3504
## epoch = 2: m-logloss = 0.2999
## epoch = 3: m-logloss = 0.2367
## epoch = 4: m-logloss = 0.1970
## epoch = 5: m-logloss = 0.1736
## epoch = 10: m-logloss = 0.1293
## epoch = 20: m-logloss = 0.0983
## epoch = 30: m-logloss = 0.0822
## epoch = 40: m-logloss = 0.0719
## epoch = 50: m-logloss = 0.0648
## epoch = 60: m-logloss = 0.0595
## epoch = 70: m-logloss = 0.0554
## epoch = 80: m-logloss = 0.0521
## epoch = 90: m-logloss = 0.0494
## epoch = 100: m-logloss = 0.0471
## epoch = 110: m-logloss = 0.0452
## epoch = 120: m-logloss = 0.0436
## epoch = 130: m-logloss = 0.0421
## epoch = 140: m-logloss = 0.0408
## epoch = 150: m-logloss = 0.0397
## epoch = 160: m-logloss = 0.0387
## epoch = 170: m-logloss = 0.0378
## epoch = 180: m-logloss = 0.0369
## epoch = 190: m-logloss = 0.0362
## epoch = 200: m-logloss = 0.0355
softmax_model <- mxnet:::mx.model.extract.model(symbol = softmax_layer,
                                                train.execs = list(my_executor))

predict_Y = predict(softmax_model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 13  0
##   3  0  2 17

使用MxNet編寫多層感知器(1)

– 讓我們嘗試一下把練習2的Softmax regression改造成具有1個隱藏層的多層感知器,其實只要把Model architecture的部分作修正,其他部分完全不需要改:

data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
relu1 = mx.symbol.Activation(data = fc1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 3, name = 'fc2')
softmax_layer = mx.symbol.softmax(data = fc2, axis = 1, name = 'softmax_layer')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')

使用MxNet編寫多層感知器(2)

#1. Build an executor to train model

my_executor = mx.simple.bind(symbol = m_logloss,
                             data = c(4, 20), label = c(3, 20),
                             ctx = mx.cpu(), grad.req = "write")

#2. Set the initial parameters
mx.set.seed(0)
new_arg = mxnet:::mx.model.init.params(symbol = m_logloss,
                                       input.shape = list(data = c(4, 20), label = c(3, 20)),
                                       output.shape = NULL,
                                       initializer = mxnet:::mx.init.uniform(0.01),
                                       ctx = mx.cpu())
mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)

#3. Define the updater
my_updater = mx.opt.get.updater(optimizer = my_optimizer, weights = my_executor$ref.arg.arrays)
set.seed(0)

for (i in 1:200) {
  
  my_iter$reset()
  batch_loss = NULL
  
  while (my_iter$iter.next()) {
    
    my_values <- my_iter$value()
    mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
    mx.exec.forward(my_executor, is.train = TRUE)
    mx.exec.backward(my_executor)
    update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
    mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
    batch_loss = c(batch_loss, as.array(my_executor$ref.outputs$m_logloss_output))
    
  }
  
  if (i %% 10 == 0 | i <= 5) {
    message(paste0("epoch = ", i, ": m-logloss = ", formatC(mean(batch_loss), format = "f", 4)))
  }
  
}
softmax_model <- mxnet:::mx.model.extract.model(symbol = softmax_layer,
                                                train.execs = list(my_executor))

predict_Y = predict(softmax_model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 13  0
##   3  0  2 17

使用MxNet編寫多層感知器(3)

my.model.FeedForward.create = function (Iterator, 
                                        loss_symbol, pred_symbol,
                                        Optimizer, num_round = 200) {
  
  #0. Check data shape
  Iterator$reset()
  Iterator$iter.next()
  my_values <- Iterator$value()
  input_shape <- lapply(my_values, dim)
  
  #1. Build an executor to train model
  exec_list = list(symbol = loss_symbol, ctx = mx.cpu(), grad.req = "write")
  exec_list = append(exec_list, input_shape)
  my_executor = do.call(mx.simple.bind, exec_list)
  
  #2. Set the initial parameters
  mx.set.seed(0)
  new_arg = mxnet:::mx.model.init.params(symbol = loss_symbol,
                                         input.shape = input_shape,
                                         output.shape = NULL,
                                         initializer = mxnet:::mx.init.uniform(0.01),
                                         ctx = mx.cpu())
  mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
  mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)
  
  #3. Define the updater
  my_updater = mx.opt.get.updater(optimizer = Optimizer, weights = my_executor$ref.arg.arrays)
  
  #4. Forward/Backward
  set.seed(0)
  
  for (i in 1:num_round) {
    
    Iterator$reset()
    batch_loss = NULL
    
    while (Iterator$iter.next()) {
      
      my_values <- Iterator$value()
      mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
      mx.exec.forward(my_executor, is.train = TRUE)
      mx.exec.backward(my_executor)
      update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
      mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
      batch_loss = c(batch_loss, as.array(my_executor$ref.outputs[[1]]))
      
    }
    
    if (i %% 10 == 0 | i <= 5) {
      message(paste0("epoch = ", i, ": loss = ", formatC(mean(batch_loss), format = "f", 4)))
    }
    
  }
  
  #5. Get model
  my_model <- mxnet:::mx.model.extract.model(symbol = pred_symbol,
                                             train.execs = list(my_executor))
  
  return(my_model)
  
}
model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)
predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 13  0
##   3  0  2 17

使用MxNet編寫多層感知器(4)

data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
dp1 = mx.symbol.Dropout(data = fc1, p = 0.2, name = 'dp1')
relu1 = mx.symbol.Activation(data = dp1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 3, name = 'fc3')
softmax_layer = mx.symbol.softmax(data = fc2, axis = 1, name = 'softmax_layer')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 13  0
##   3  0  2 17

– 你是否開始覺得使用MxNet能加速自己對人工智慧的開發速度?

練習3:深層網路的優化

data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
relu1 = mx.symbol.Activation(data = fc1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 10, name = 'fc2')
relu2 = mx.symbol.Activation(data = fc2, act.type = 'relu', name = 'relu2')
fc3 = mx.symbol.FullyConnected(data = relu2, num.hidden = 10, name = 'fc3')
relu3 = mx.symbol.Activation(data = fc3, act.type = 'relu', name = 'relu3')
fc4 = mx.symbol.FullyConnected(data = relu3, num.hidden = 3, name = 'fc4')
softmax_layer = mx.symbol.softmax(data = fc4, axis = 1, name = 'softmax_layer')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')
model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   2 18 15 17

練習3答案

my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
                             epsilon = 1e-08, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 12  0
##   3  0  3 17

練習3的視覺化證明(1)

my.model.FeedForward.create = function (Iterator, 
                                        loss_symbol, pred_symbol,
                                        Optimizer, num_round = 200) {
  
  require(abind)
  
  #0. Check data shape
  Iterator$reset()
  Iterator$iter.next()
  my_values <- Iterator$value()
  input_shape <- lapply(my_values, dim)
  batch_size <- tail(input_shape[[1]], 1)
  
  #1. Build an executor to train model
  exec_list = list(symbol = loss_symbol, ctx = mx.cpu(), grad.req = "write")
  exec_list = append(exec_list, input_shape)
  my_executor = do.call(mx.simple.bind, exec_list)
  
  #2. Set the initial parameters
  mx.set.seed(0)
  new_arg = mxnet:::mx.model.init.params(symbol = loss_symbol,
                                         input.shape = input_shape,
                                         output.shape = NULL,
                                         initializer = mxnet:::mx.init.uniform(0.01),
                                         ctx = mx.cpu())
  mx.exec.update.arg.arrays(my_executor, new_arg$arg.params, match.name = TRUE)
  mx.exec.update.aux.arrays(my_executor, new_arg$aux.params, match.name = TRUE)
  
  #3. Define the updater
  my_updater = mx.opt.get.updater(optimizer = Optimizer, weights = my_executor$ref.arg.arrays)
  
  #4. Forward/Backward
  message('Start training:')
  
  set.seed(0)
  epoch_grad = NULL
  
  for (i in 1:num_round) {
    
    Iterator$reset()
    batch_loss = list()
    batch_grad = list()
    batch_seq = 0
    t0 = Sys.time()
    
    while (Iterator$iter.next()) {
      
      my_values <- Iterator$value()
      mx.exec.update.arg.arrays(my_executor, arg.arrays = my_values, match.name = TRUE)
      mx.exec.forward(my_executor, is.train = TRUE)
      mx.exec.backward(my_executor)
      update_args = my_updater(weight = my_executor$ref.arg.arrays, grad = my_executor$ref.grad.arrays)
      mx.exec.update.arg.arrays(my_executor, update_args, skip.null = TRUE)
      batch_loss[[length(batch_loss) + 1]] = as.array(my_executor$ref.outputs[[1]])
      grad_list = sapply(my_executor$ref.grad.arrays, function (x) {if (!is.null(x)) {mean(abs(as.array(x)))}})
      grad_list = unlist(grad_list[grepl('weight', names(grad_list), fixed = TRUE)])
      batch_grad[[length(batch_grad) + 1]] = grad_list
      batch_seq = batch_seq + 1
      
    }
    
    if (i %% 10 == 0 | i <= 5) {
      message(paste0("epoch = ", i,
                     ": loss = ", formatC(mean(unlist(batch_loss)), format = "f", 4),
                     " (Speed: ", formatC(batch_seq * batch_size/as.numeric(Sys.time() - t0, units = 'secs'), format = "f", 2), " sample/secs)"))
    }
    
    epoch_grad = rbind(epoch_grad, apply(abind(batch_grad, along = 2), 1, mean))

  }
  
  epoch_grad[epoch_grad < 1e-8] = 1e-8
  
  COL = rainbow(ncol(epoch_grad))
  random_pos = 2^runif(ncol(epoch_grad), -0.5, 0.5)
  
  plot(epoch_grad[,1] * random_pos[1], type = 'l', col = COL[1],
       xlab = 'epoch', ylab = 'mean of abs(grad)', log = 'y',
       ylim = range(epoch_grad))
  
  for (i in 2:ncol(epoch_grad)) {lines(1:nrow(epoch_grad), epoch_grad[,i] * random_pos[i], col = COL[i])}
  
  legend('bottomright', paste0('fc', 1:ncol(epoch_grad), '_weight'), col = COL, lwd = 1)
  
  #5. Get model
  my_model <- mxnet:::mx.model.extract.model(symbol = pred_symbol,
                                             train.execs = list(my_executor))
  
  return(my_model)
  
}

練習3的視覺化證明(2)

my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   2 18 15 17
my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
                             epsilon = 1e-08, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 12  0
##   3  0  3 17

梯度完全消失了

data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
relu1 = mx.symbol.Activation(data = fc1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 10, name = 'fc2')
relu2 = mx.symbol.Activation(data = fc2, act.type = 'relu', name = 'relu2')
fc3 = mx.symbol.FullyConnected(data = relu2, num.hidden = 10, name = 'fc3')
relu3 = mx.symbol.Activation(data = fc3, act.type = 'relu', name = 'relu3')
fc4 = mx.symbol.FullyConnected(data = relu3, num.hidden = 10, name = 'fc4')
relu4 = mx.symbol.Activation(data = fc4, act.type = 'relu', name = 'relu4')
fc5 = mx.symbol.FullyConnected(data = relu4, num.hidden = 10, name = 'fc5')
relu5 = mx.symbol.Activation(data = fc5, act.type = 'relu', name = 'relu5')
fc6 = mx.symbol.FullyConnected(data = relu5, num.hidden = 10, name = 'fc6')
relu6 = mx.symbol.Activation(data = fc6, act.type = 'relu', name = 'relu6')
fc7 = mx.symbol.FullyConnected(data = relu6, num.hidden = 10, name = 'fc7')
relu7 = mx.symbol.Activation(data = fc7, act.type = 'relu', name = 'relu7')
fc8 = mx.symbol.FullyConnected(data = relu7, num.hidden = 3, name = 'fc8')
softmax_layer = mx.symbol.softmax(data = fc8, axis = 1, name = 'softmax_layer')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')

my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
                             epsilon = 1e-08, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   2 18 15 17

– 在梯度完全消失的狀況下,無論你多努力調整參數這個問題會再也無法解決。

數據分布問題(1)

– 理論上,反向傳播的過程隨著離輸出層越來越遠,梯度也將越來越小,讓我們再看看當初多層感知機的梯度公式吧:

\[ \begin{align} grad.o & = \frac{\partial}{\partial o}loss = \frac{o-y}{o(1-o)} \\ grad.l_2 & = \frac{\partial}{\partial l_2}loss = grad.o \otimes \frac{\partial}{\partial l_2}o= o-y \\ grad.W^2_1 & = \frac{\partial}{\partial W^2_1}loss = grad.l_2 \otimes \frac{\partial}{\partial W^2_1}l_2 = \frac{{1}}{n} \otimes (h_1^E)^T \bullet grad.l_2\\ grad.h_1^E & = \frac{\partial}{\partial h_1^E}loss = grad.l_2 \otimes \frac{\partial}{\partial h_1^E}l_2 = grad.l_2 \bullet (W^2_1)^T \\ grad.l_1 & = \frac{\partial}{\partial l_1}loss = grad.h_1 \otimes \frac{\partial}{\partial l_1}h_1 = grad.h_1 \otimes \frac{\partial}{\partial l_1}ReLU(l_1) \\ grad.W^1_d & = \frac{\partial}{\partial W^1_d}loss = grad.l_1 \otimes \frac{\partial}{\partial W^1_d}l_1 = \frac{{1}}{n} \otimes (x^E)^T \bullet grad.l_1 \end{align} \]

數據分布問題(2)

– 我們試想一下,目前我們隨機決定的權重大多是介於0的附近,因此輸入的值如果變異非常大,那就會造成梯度的波動。

TRAIN.X.array[3:4,] = TRAIN.X.array[3:4,] * 1000
TEST.X.array[3:4,] = TEST.X.array[3:4,] * 1000
data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
dp1 = mx.symbol.Dropout(data = fc1, p = 0.2, name = 'dp1')
relu1 = mx.symbol.Activation(data = dp1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 3, name = 'fc3')
softmax_layer = mx.symbol.softmax(data = fc2, axis = 1, name = 'softmax_layer')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')

my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
                             epsilon = 1e-08, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 13  4
##   3  0  2 13
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   2 18 15 17

數據分布問題(3)

data(iris)

X.array = array(t(as.matrix(iris[,-5])), dim = c(4, 150))
for (i in 1:4) {X.array[i,] = (X.array[i,] - mean(X.array[i,]))/sd(X.array[i,])}
Y.array = array(t(model.matrix(~ -1 + iris[,5])), dim = c(3, 150))

set.seed(0)
TRAIN.seq = sample(1:150, 100)

TRAIN.X.array = X.array[,TRAIN.seq]
TRAIN.Y.array = Y.array[,TRAIN.seq]
TEST.X.array = X.array[,-TRAIN.seq]
TEST.Y.array = Y.array[,-TRAIN.seq]
my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 13  0
##   3  0  2 17
my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
                             epsilon = 1e-08, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 14  0
##   3  0  1 17

數據分布問題(4)

– 這個做法叫做「批量標準化」(Batch normalization),兩位Google的研究員Sergey Ioffe以及Christian Szegedy在2015年所發表的研究:Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift第一次提到了這個想法。

F5_2

數據分布問題(5)

\[ \begin{align} \hat{x_i} & = \frac{x_i - \bar{x}}{\sqrt{\sigma^2_{x} + \epsilon}} \\ y_i = BatchNorm(x_i) & = \hat{x_i} \times \gamma \ + \beta \\\\ \bar{x} & = \frac{1}{n} \sum\limits_{i=1}^{n} x_i \\ \sigma^2_{x} & = \frac{1}{n} \sum\limits_{i=1}^{n} (x_i - \bar{x})^2 \end{align} \]

– 這裡的\(\epsilon\)代表一個很小的數字(避免除以0),\(\bar{x}\)\(\sigma^2_{x}\)則分別是\(x\)的平均值以及變異數,\(\gamma\)以及\(\beta\)則是兩個線性轉換項,這使批量標準化是一個可還原的過程(假定\(\gamma = \sqrt{\sigma^2_{x} + \epsilon}\)\(\beta = \bar{x}\))

data(iris)

demo_X.array = array(t(as.matrix(iris[,-5])), dim = c(4, 150))
bn_X.array = demo_X.array
for (i in 1:4) {
  bn_mean = mean(demo_X.array[i,])
  bn_var = var(demo_X.array[i,])
  eps = 1e-3
  gamma = 1
  beta = 0
  bn_X.array[i,] = (demo_X.array[i,] - bn_mean) /  sqrt(bn_var + eps) * gamma + beta
}

數據分布問題(6)

– 我們假設在反向傳播到\(BatchNorm\)時已經存在一個\(grad.y\),並以這個開始往下推導(過程略):

\[\begin{align} \frac{\partial y}{\partial \beta} & = \frac{1}{n} \sum\limits_{i=1}^{n} grad.y_i \\ \frac{\partial y}{\partial \gamma} & = \frac{1}{n} \sum\limits_{i=1}^{n} grad.y_i \times \hat{x_i} \\\\ \frac{\partial y}{\partial \hat{x}} & = grad.y \otimes \gamma \\ \frac{\partial y}{\partial \sigma^2_{x}} & = - \frac{1} {2} \sum\limits_{i=1}^{n} \gamma (x_i - \bar{x}) (\sigma^2_{x} + \epsilon)^{-1.5} grad.y_i \\ \frac{\partial y}{\partial \bar{x}} & = \sum\limits_{i=1}^{n} \frac {- grad.y_i \times \gamma} {\sqrt{\sigma^2_{x} + \epsilon}} + \frac{\partial y}{\partial \sigma^2_{x}} \times \frac {-2 \sum\limits_{i=1}^{n} (x_i - \bar{x}) } {n} \\\\ \frac{\partial y}{\partial x} & = \frac{\partial y}{\partial \hat{x}} \otimes \frac {1} {\sqrt{\sigma^2_{x} + \epsilon}} \oplus \frac{\partial y}{\partial \sigma^2_{x}} \otimes \frac {2(x_i - \bar{x})} {n} \oplus \frac{\partial y}{\partial \bar{x}} \otimes \frac {1} {n}\end{align}\]

數據分布問題(7)

– 在MxNet的輔助下,要實現批量標準化跟實現Dropout一樣簡單!

data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
bn1 = mx.symbol.BatchNorm(data = fc1, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn1')
relu1 = mx.symbol.Activation(data = bn1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 10, name = 'fc2')
bn2 = mx.symbol.BatchNorm(data = fc2, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn2')
relu2 = mx.symbol.Activation(data = bn2, act.type = 'relu', name = 'relu2')
fc3 = mx.symbol.FullyConnected(data = relu2, num.hidden = 10, name = 'fc3')
bn3 = mx.symbol.BatchNorm(data = fc3, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn3')
relu3 = mx.symbol.Activation(data = bn3, act.type = 'relu', name = 'relu3')
fc4 = mx.symbol.FullyConnected(data = relu3, num.hidden = 3, name = 'fc4')
softmax_layer = mx.symbol.softmax(data = fc4, axis = 1, name = 'softmax_layer')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')

my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 14  1
##   3  0  1 16

練習4:重現使用BN的MLP之推理過程

PARAMS = model$arg.params
MEANS = model$aux.params
ls(PARAMS)
##  [1] "bn1_beta"   "bn1_gamma"  "bn2_beta"   "bn2_gamma"  "bn3_beta"  
##  [6] "bn3_gamma"  "fc1_bias"   "fc1_weight" "fc2_bias"   "fc2_weight"
## [11] "fc3_bias"   "fc3_weight" "fc4_bias"   "fc4_weight"
ls(MEANS)
## [1] "bn1_moving_mean" "bn1_moving_var"  "bn2_moving_mean" "bn2_moving_var" 
## [5] "bn3_moving_mean" "bn3_moving_var"
Input = TEST.X.array[,1]
dim(Input) = c(4, 1)
preds = predict(model, Input, array.layout = "colmajor")
print(preds)
##              [,1]
## [1,] 9.999359e-01
## [2,] 2.815460e-05
## [3,] 3.597806e-05

練習4答案

PARAMS = model$arg.params
MEANS = model$aux.params

Input = TEST.X.array[,1]
dim(Input) = c(4, 1)

bn_eps = 1e-3

fc1_out = t(Input) %*% as.array(PARAMS$fc1_weight) + as.array(PARAMS$fc1_bias)
bn1_out = (fc1_out - as.array(MEANS$bn1_moving_mean)) / sqrt(as.array(MEANS$bn1_moving_var) + bn_eps) * as.array(PARAMS$bn1_gamma) + as.array(PARAMS$bn1_beta)
relu1_out = bn1_out
relu1_out[relu1_out < 0] = 0

fc2_out = relu1_out %*% as.array(PARAMS$fc2_weight) + as.array(PARAMS$fc2_bias)
bn2_out = (fc2_out - as.array(MEANS$bn2_moving_mean)) / sqrt(as.array(MEANS$bn2_moving_var) + bn_eps) * as.array(PARAMS$bn2_gamma) + as.array(PARAMS$bn2_beta)
relu2_out = bn2_out
relu2_out[relu2_out < 0] = 0

fc3_out = relu2_out %*% as.array(PARAMS$fc3_weight) + as.array(PARAMS$fc3_bias)
bn3_out = (fc3_out - as.array(MEANS$bn3_moving_mean)) / sqrt(as.array(MEANS$bn3_moving_var) + bn_eps) * as.array(PARAMS$bn3_gamma) + as.array(PARAMS$bn3_beta)
relu3_out = bn3_out
relu3_out[relu3_out < 0] = 0

fc4_out = relu3_out %*% as.array(PARAMS$fc4_weight) + as.array(PARAMS$fc4_bias)

Softmax_out = exp(fc4_out)/sum(exp(fc4_out))
cbind(t(Softmax_out), preds)
##              [,1]         [,2]
## [1,] 9.999359e-01 9.999359e-01
## [2,] 2.815463e-05 2.815460e-05
## [3,] 3.597807e-05 3.597806e-05

BN對深層網路優化的優勢(1)

data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')
fc1 = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = 'fc1')
bn1 = mx.symbol.BatchNorm(data = fc1, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn1')
relu1 = mx.symbol.Activation(data = bn1, act.type = 'relu', name = 'relu1')
fc2 = mx.symbol.FullyConnected(data = relu1, num.hidden = 10, name = 'fc2')
bn2 = mx.symbol.BatchNorm(data = fc2, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn2')
relu2 = mx.symbol.Activation(data = bn2, act.type = 'relu', name = 'relu2')
fc3 = mx.symbol.FullyConnected(data = relu2, num.hidden = 10, name = 'fc3')
bn3 = mx.symbol.BatchNorm(data = fc3, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn3')
relu3 = mx.symbol.Activation(data = bn3, act.type = 'relu', name = 'relu3')
fc4 = mx.symbol.FullyConnected(data = relu3, num.hidden = 10, name = 'fc4')
bn4 = mx.symbol.BatchNorm(data = fc4, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn4')
relu4 = mx.symbol.Activation(data = bn4, act.type = 'relu', name = 'relu4')
fc5 = mx.symbol.FullyConnected(data = relu4, num.hidden = 10, name = 'fc5')
bn5 = mx.symbol.BatchNorm(data = fc5, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn5')
relu5 = mx.symbol.Activation(data = bn5, act.type = 'relu', name = 'relu5')
fc6 = mx.symbol.FullyConnected(data = relu5, num.hidden = 10, name = 'fc6')
bn6 = mx.symbol.BatchNorm(data = fc6, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn6')
relu6 = mx.symbol.Activation(data = bn6, act.type = 'relu', name = 'relu6')
fc7 = mx.symbol.FullyConnected(data = relu6, num.hidden = 10, name = 'fc7')
bn7 = mx.symbol.BatchNorm(data = fc7, axis = 1, eps = 1e-3, fix.gamma = TRUE, name = 'bn7')
relu7 = mx.symbol.Activation(data = bn7, act.type = 'relu', name = 'relu7')
fc8 = mx.symbol.FullyConnected(data = relu7, num.hidden = 3, name = 'fc8')
softmax_layer = mx.symbol.softmax(data = fc8, axis = 1, name = 'softmax_layer')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')

my_optimizer = mx.opt.create(name = "sgd", learning.rate = 0.05, momentum = 0.9, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   1 18  0  0
##   2  0 13  0
##   3  0  2 17

BN對深層網路優化的優勢(2)

data = mx.symbol.Variable(name = 'data')
label = mx.symbol.Variable(name = 'label')

for (i in 1:30) {
  if (i == 1) {
    fc = mx.symbol.FullyConnected(data = data, num.hidden = 10, name = paste0('fc', i))
  } else {
    fc = mx.symbol.FullyConnected(data = relu, num.hidden = 10, name = paste0('fc', i))
  }
  bn = mx.symbol.BatchNorm(data = fc, axis = 1, name = paste0('bn', i))
  relu = mx.symbol.Activation(data = bn, act.type = 'relu', name = paste0('relu', i))
}

fc_final = mx.symbol.FullyConnected(data = relu, num.hidden = 3, name = 'fc_final')
softmax_layer = mx.symbol.softmax(data = fc_final, axis = 1, name = 'softmax_layer')

eps = 1e-8
m_log = 0 - mx.symbol.mean(mx.symbol.broadcast_mul(mx.symbol.log(softmax_layer + eps), label))
m_logloss = mx.symbol.MakeLoss(m_log, name = 'm_logloss')

my_optimizer = mx.opt.create(name = "adam", learning.rate = 0.001, beta1 = 0.9, beta2 = 0.999,
                             epsilon = 1e-08, wd = 0)

model = my.model.FeedForward.create(Iterator = my_iter,
                                    loss_symbol = m_logloss, pred_symbol = softmax_layer,
                                    Optimizer = my_optimizer, num_round = 200)

predict_Y = predict(model, TEST.X.array, array.layout = "colmajor")
confusion_table = table(max.col(t(predict_Y)), max.col(t(TEST.Y.array)))
print(confusion_table)
##    
##      1  2  3
##   2 18 15 17

– 這個問題其實是權重在最開始的時候是完全隨機決定的,故你可以想像在正向傳播的過程其實就是不斷的在損失資訊,而到了30層或更深的時候的時候這些資訊其實跟0沒有什麼差別了。而反向傳播卻要透過這些誤差來往回做梯度更新,從而導致整個網路優化失敗。

結語

– 目前我們面對梯度消失問題僅有第三節課的時候所學習的Adam以及Batch Normalization。Adam的優勢在於若各層尺度差異極大時,那他還有辦法平衡各層參數的更新速度。Batch Normalization則可以減緩梯度在反向傳播時的損失。然而若是梯度直接變成0,那這些方法都將無濟於事。

– 然而2012年以後的人工智慧研究熱點是在於電腦視覺,因此這些突破性的研究也大多也在這個領域。而目前為止我們尚未開始正式介紹人工智慧領域在電腦視覺上的特殊發展,故我們下一節課會先將這些知識做一些補充,之後在基於這些知識來看看這些最先進的研究如何有效的解決梯度消失問題!